AITopics | off-policy q-learning

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Neural Information Processing SystemsDec-25-2025, 22:49:13 GMT

Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify \emph{bootstrapping error} as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it. Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random data and suboptimal demonstrations, on a range of continuous control tasks.

bootstrapping error reduction, name change, off-policy q-learning, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Reviews: Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Neural Information Processing SystemsJan-26-2025, 21:33:36 GMT

Summary: This paper proposes a new algorithm that help stabilize off-policy Q-learning. The idea is to introduce approximate Bellman updates that are based on constraint actions sampled only from the support of the training data distribution. The paper shows the main source of instability is the boostrapping error. The boostrapping process might use actions that do not lie in the training data distribution. This work shows a way to mitigate this issue.

bootstrapping error reduction, constraint, off-policy q-learning, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.40)

Add feedback

Reviews: Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Neural Information Processing SystemsJan-26-2025, 21:33:25 GMT

The reviewers were in consensus about the merits of this paper, in particular the value of the proposed approach and the theoretical analysis. Some concerns were raised about the experimental validation but these have been alleviated by the new results and baselines added during rebuttal. Some concerns remain regarding the clarity of the paper. The authors claim to have revised the text but we are not able to see it to validate that it has improved in this respect. The authors are strongly encouraged to put some real effort into improving the clarity of the final version.

bootstrapping error reduction, clarity, off-policy q-learning

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.40)

Add feedback

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Neural Information Processing SystemsOct-10-2024, 21:12:04 GMT

Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify \emph{bootstrapping error} as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator.

bootstrapping error reduction, data distribution, off-policy q-learning, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning

Roderick, Melrose, Manek, Gaurav, Berkenkamp, Felix, Kolter, J. Zico

arXiv.org Artificial IntelligenceNov-24-2023

A key problem in off-policy Reinforcement Learning (RL) is the mismatch, or distribution shift, between the dataset and the distribution over states and actions visited by the learned policy. This problem is exacerbated in the fully offline setting. The main approach to correct this shift has been through importance sampling, which leads to high-variance gradients. Other approaches, such as conservatism or behavior-regularization, regularize the policy at the cost of performance. In this paper, we propose a new approach for stable off-policy Q-Learning. Our method, Projected Off-Policy Q-Learning (POP-QL), is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.

conference paper, contraction mapping condition, pop-ql, (12 more...)

arXiv.org Artificial Intelligence

2311.14885

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Massachusetts (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Kumar, Aviral, Fu, Justin, Soh, Matthew, Tucker, George, Levine, Sergey

Neural Information Processing SystemsMar-19-2020, 01:30:48 GMT

Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify \emph{bootstrapping error} as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator.

bootstrapping error reduction, data distribution, off-policy q-learning, (1 more...)

Neural Information Processing Systems

Technology: